-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: use core Collection and Select for sig loading #197
Conversation
…ash_plugin_branchwater into use-core-more-broadly
@ctb ok - updated description, added a test for names with spaces, and made a few doc changes. Some things you may want to think on before merge:
|
This is as opposed to before, when for some commands, like gather, it would filter them on load, right? That's ok. It joins regular sourmash in that regard 😓 sourmash-bio/sourmash#1899
Yep! OK, will look at it as soon as I can. Thanks!! |
build is reporting this; intentional? (UPDATED to use the right branch 😭 )
|
This comment was marked as outdated.
This comment was marked as outdated.
More questions / thoughts:
|
This comment was marked as outdated.
This comment was marked as outdated.
(turns out I was looking at the wrong branch 😅 - updating things above now!) |
'fastgather' hangs when using a rocksdb:
...waiting for over a minute now. |
ok, I think those are my main questions/comments now (I hid the outdated ones from using the wrong branch 😊 ) |
I disabled trying to load
Actually, it's not nearly as bad as I was thinking - I just needed to think about I think we can also solve this by allowing manifest input to avoid needing to regenerate, added in #211.
Only pathlist sig handling changed re: indexed vs non-indexed search. So, we should recommend zip files over pathlists to avoid the extra work of generating the manifest. With #211, we can recommend manifests or zip files over pathlists. |
(I think the zip file loading should win over everything else ;) |
This PR replaces signature loading code with calls to supported functions from sourmash core 0.12.1. It also standardizes csv writing for most functions and standardizes sig loading error handling.
Before this PR, signature loading used custom preparation functions to load all signatures into memory (selecting those with matching parameters). For zipfile reading, we had a hacky implementation that copied all files out of the zip into a temporary location first, which then allowed us to load them properly.
With this PR, we now use sourmash
Collection
andSelect
for signature loading. For zip files, this allows lazy loading for signatures, meaning we do all parameter selection via the manifest and only load signatures into memory when we actually want to use them. Downsampling happens at the time of signature loading, either viaSelect
or passing downsample tocount_common
or similar. For sig or pathlist input, we first load all signatures into memory and generate collection with manifest. We then continue as before, selecting on the manifest first to remove signatures that don't match user-specified parameters (ksize, scaled).Note that since default core loading of sigs and pathlists did not allow failed sigs, I reimplemented loading here to allow for skipped and failed paths.
Closes #196.
Requirements
Requires changes from:
I think we should release a new core for this and use it here prior to merge.
Functionality Changes
query
oragainst
. This is now checked and handled as part of collection loading.fastmultigather
, query-based prefetch/gather output filenames change slightly. I realized after changing over toquery.name
that we do have access torecord.internal_location()
. However, this doesn't always exist? Would appreciate your thoughts on the new structure @ctb. Can switch back if you think it's needed/better.New functions:
load_collection
: Unified signature Collection loading from zipfile, pathlist, or sigs (no more hacky zip loading!!! and use manifests for selection!)report_on_collection_loading
: report n_failed, n_skipped, n_loaded. replacesreport_on_sketch_loading
.build_selection
: replacesbuild_template
. Builds a selection object we can use to select by ksize, scaled, etcModified Utility Functions
load_sketches
- now loads all sketches from a collection into memory as SmallSignature objects, downsampling along the way. Used anywhere we load a full set of minhashes into memory (pairwise
,multisearch
,manysearch
,fastgather
,fastmultigather
)These were modified to use collections and
SigStore
/Signature
rather thanSmallSignature
for places where we load a single signature at a time:write_prefetch
load_sketches_above_threshold
consume_query_by_gather
Minor Refactoring
*Result
structures where they were missing to standardize writing and make adding columns easy*Result
objects withcsv_threadwriter
Deleted Functions
load_sigpaths_from_zip
load_sketches_from_zip
load_sigpaths_from_zip_or_pathlist
load_sketches_from_zip_or_pathlist
load_sketchlist_filenames
build_template
check_compatible_downsample
report_on_sketch_loading
prepare_query
Punted Issues
test_gather.py
test_against_multisigfile
- was solved for zips and sig files, but not pathlists. See EXP: Show multisig loading issue #202core
thoughts to issue: additional 0.12.x core thoughts sourmash#2967manysketch
use new core manifest struct and functions inmanysketch
#203Questions (Resolved in review)
Do we want functions to exit gracefully if no signatures are loaded for query or against, or do we want it to fail? It was done differently between different subcommands. I standardized to exit gracefully (as was done in fastgather), but happy to change back.